
Getting (more) confortable with statistics
During this summer, I’ve taken the decision of getting better at Statistics. My relationship with statistics was a kind of relation that you can stare to each other and maybe get an intuition of what’s going on, but no more than that. Anything that involved, for example, understanding how is a pearson’ correlation matrix calculated and what’s the meaning of the math behind that gave me the chills since I really had bad foundations.
Well, so during this summer I’ve decided to join the Master in Analysis and Engineering of Big Data at FCT NOVA - at least partially, since working at Feedzai will still occupy most of my time.
The courses I’ve taken are:
Multivariate Stats
You can find my online book of this course here: Multivariate Stats
The goal of this course is to put the students familiar with the inference of multivariate means and covariance matrixes, as well as gaussian (populations) linear models and dimensionality reduction techniques. In order to apply the knowledge gathered, it is then applied on data discrimination and classification.
At this point, this made me get more confortable with matrix operations as well as method’s assumption on normality.
For example, let’s talk a little bit about the determinant of a matrix
Matrix Determinant
It was during the time I was studing thissubject that I get a grasp on what is, semanticaly, the determinant of a matrix - kudos to 3blue1brow for his amazing job on explaining that!
Let me try to summarise this in a few lines of codes and plots.
Assume that we have the following data with the following covariance matrix:
set.seed(1)
df <- data.frame(
v1 = rnorm(20, 4,2),
v2 = rchisq(20, 2)
)
plot <- ggplot(df, aes(v1, v2))
plotly::ggplotly(plot + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white"))
dt.cov <- cov(df)
kableExtra::kable(round(dt.cov,2)) %>% kableExtra::kable_styling(position = "center")
| v1 | v2 | |
|---|---|---|
| v1 | 3.34 | 0.08 |
| v2 | 0.08 | 2.86 |
We can interpret this matrix as:
- The first variabel (\(v1\)) has a variance of 3.34
- The second variable (\(v2\)) has a variance of 2.86
- The first and second variable have a covariance of 0.08
So, \(v1\) and \(v2\) do not vary a lot between them. Another way to see this would be to normalize the covariance value by the multiplication of each variable’s variance - also known as the pearson correlation:
dt.cor <- cor(df)
kableExtra::kable(round(dt.cor,2)) %>% kableExtra::kable_styling(position = "center")
| v1 | v2 | |
|---|---|---|
| v1 | 1.00 | 0.03 |
| v2 | 0.03 | 1.00 |
So we can really check that, at least in a linear way, they are pretty independent. Another way to say this is that these two variables together gather more information than just one.
Now we can see the determinant of this matrix as being the following:
drawMatrixWithDet(dt.cor,dim(dt.cor)[1])
You can image this area as being the “area of information” that the matrix contains.
Let’s now see another example, now with variables a little more correlated:
set.seed(1)
v <- rnorm(20, 0,2)
df2 <- data.frame(
v1 = v,
v2 = v*0.3 + rnorm(20,0,0.3)
)
plot <- ggplot(df2, aes(v1, v2))
plotly::ggplotly(plot + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white"))
Yup, that linear pattern really indicates some covariance!
dt2.cov <- cov(df2)
kableExtra::kable(round(dt2.cov,2)) %>% kableExtra::kable_styling(position = "center")
| v1 | v2 | |
|---|---|---|
| v1 | 3.34 | 0.90 |
| v2 | 0.90 | 0.31 |
Hm… but this covariance matrix is not very expressive about this. Let’s check correlation matrix, which will tell us this for sure!
| v1 | v2 | |
|---|---|---|
| v1 | 1.00 | 0.89 |
| v2 | 0.89 | 1.00 |
And here it is! They seem to be pretty correlated!
So, how is now the area of information on this matrix compared to the previous one?
And how is this useful?
Well, let’s take an example on 3d now!
The data:
set.seed(1)
v <- rnorm(20, 0,2)
df3 <- data.frame(
v1 = v,
v2 = v*0.3 + abs(rnorm(20,0,0.3)),
v3 = rchisq(20,4)
)
a <- list(title = "v1 VS v2")
b <- list(title = "v3 VS v2")
c <- list(title = "v3 VS v1")
p1 <- plotly::ggplotly(ggplot(df3, aes(v1, v2)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = a)
p2 <- plotly::ggplotly(ggplot(df3, aes(v3, v2)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = b)
p3 <- plotly::ggplotly(ggplot(df3, aes(v3, v1)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = c)
plotly::subplot(
p1,p2,p3,nrows =1,titleX = TRUE
)
Now we have see that there is a correlation between \(v1\) and \(v2\), but not so much between neither \(v3\) and \(v2\) nor \(v3\) and \(v1\):
dt3.cor <- cor(df3)
kableExtra::kable(round(dt3.cor,2)) %>% kableExtra::kable_styling(position = "center")
| v1 | v2 | v3 | |
|---|---|---|---|
| v1 | 1.00 | 0.98 | -0.29 |
| v2 | 0.98 | 1.00 | -0.34 |
| v3 | -0.29 | -0.34 | 1.00 |
And now let’s observe the determinant of this matrix:
drawMatrixWithDet(dt3.cor,dim(dt3.cor)[1])
(Please notive that you can rotate the above image, as well as reset the axis at the top right corner of the image)
We can see that the volume of information of this matrix is almost a plane instead of a space! This is also pretty noticeable when you take into consideration the value of the determinant, which is 0.03. What does it means? It means that the matrix have more dimensions than necessary - that to represent the information it can do it with less dimensions.
That, or else the matrix may have really no information at all (which is not the case of this example).
Computational Stats
Let’s talk about resampling to get confidence intervals!
You can find my online log of this course here: Computational Stats
All in all
All in all I believe this is being really useful and I’m learning a lot - more important, I’m getting familiar with a lot of terms and concepts in both multivariate stats and computational stats! Which is awesome! Both courses will end at the begining of January 2019, so I’m really excited to continue learning more while this lasts!